{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Deep Sigma Point Processes\n",
    "\n",
    "In this notebook, we provide a GPyTorch implementation of Deep Sigma Point Processes (DSPPs), as described in Jankowiak et al., 2020 (http://www.auai.org/uai2020/proceedings/339_main_paper.pdf).\n",
    "\n",
    "It will be useful to compare and contrast this notebook with our standard Deep GP notebook, as the computational structure of a Deep GP and a DSPP are quite similar."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import gpytorch\n",
    "import torch\n",
    "from gpytorch.likelihoods import GaussianLikelihood\n",
    "from gpytorch.means import ConstantMean, LinearMean\n",
    "from gpytorch.kernels import ScaleKernel, MaternKernel\n",
    "from gpytorch.variational import VariationalStrategy, BatchDecoupledVariationalStrategy\n",
    "from gpytorch.variational import MeanFieldVariationalDistribution\n",
    "from gpytorch.models.deep_gps.dspp import DSPPLayer, DSPP\n",
    "import gpytorch.settings as settings"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Basic settings\n",
    "\n",
    "In the next cell, we define some basic settings that can be tuned. The only hyperparameter that is DSPP specific is `num_quadrature_sites`, which effectively determines the number of mixtures that the output distribution will have. `hidden_dim` controls the width of the hidden GP layer. The other parameters are standard optimization hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "# this is for running the notebook in our testing framework\n",
    "smoke_test = ('CI' in os.environ)\n",
    "\n",
    "batch_size = 1000                 # Size of minibatch\n",
    "milestones = [20, 150, 300]       # Epochs at which we will lower the learning rate by a factor of 0.1\n",
    "num_inducing_pts = 300            # Number of inducing points in each hidden layer\n",
    "num_epochs = 400                  # Number of epochs to train for\n",
    "initial_lr = 0.01                 # Initial learning rate\n",
    "hidden_dim = 3                    # Number of GPs (i.e., the width) in the hidden layer.\n",
    "num_quadrature_sites = 8          # Number of quadrature sites (see paper for a description of this. 5-10 generally works well).\n",
    "\n",
    "## Modified settings for smoke test purposes\n",
    "num_epochs = num_epochs if not smoke_test else 1"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Loading Data\n",
    "\n",
    "For this example notebook, we'll be using the `bike` UCI dataset used in the paper. Running the next cell downloads a copy of the dataset. We will be using the same normalization, randomization, and train/test splitting scheme as used in the paper, although for this demo notebook we do not use a validation set as we won't be tuning any hyperparameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "torch.Size([13034, 17]) torch.Size([13034]) torch.Size([4345, 17]) torch.Size([4345])\n"
     ]
    }
   ],
   "source": [
    "import urllib.request\n",
    "from scipy.io import loadmat\n",
    "from math import floor\n",
    "\n",
    "\n",
    "if not smoke_test and not os.path.isfile('../bike.mat'):\n",
    "    print('Downloading \\'bike\\' UCI dataset...')\n",
    "    urllib.request.urlretrieve('https://drive.google.com/uc?export=download&id=1pR1H9ee4U89C1y_uYe9qAypKsHs1EL5I', '../bike.mat')\n",
    "\n",
    "if smoke_test:  # this is for running the notebook in our testing framework\n",
    "    X, y = torch.randn(1000, 3), torch.randn(1000)\n",
    "else:\n",
    "    data = torch.Tensor(loadmat('bike.mat')['data'])\n",
    "\n",
    "    # Map features to [-1, 1]\n",
    "    X = data[:, :-1]\n",
    "    X = X - X.min(0)[0]\n",
    "    X = 2.0 * (X / X.max(0)[0]) - 1.0\n",
    "\n",
    "    # Z-score labels\n",
    "    y = data[:, -1]\n",
    "    y -= y.mean()\n",
    "    y /= y.std()\n",
    "\n",
    "shuffled_indices = torch.randperm(X.size(0))\n",
    "X = X[shuffled_indices, :]\n",
    "y = y[shuffled_indices]\n",
    "\n",
    "train_n = int(floor(0.75 * X.size(0)))\n",
    "\n",
    "train_x = X[:train_n, :].contiguous()\n",
    "train_y = y[:train_n].contiguous()\n",
    "test_x = X[train_n:, :].contiguous()\n",
    "test_y = y[train_n:].contiguous()\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    train_x, train_y, test_x, test_y = train_x.cuda(), train_y.cuda(), test_x.cuda(), test_y.cuda()\n",
    "\n",
    "print(train_x.shape, train_y.shape, test_x.shape, test_y.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create PyTorch `DataLoader` objects\n",
    "\n",
    "As we will be training and predicting on minibatches, we use the standard PyTorch `TensorDataset` and `DataLoader` framework to handle getting batches of data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "from torch.utils.data import TensorDataset, DataLoader\n",
    "train_dataset = TensorDataset(train_x, train_y)\n",
    "train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)\n",
    "\n",
    "test_dataset = TensorDataset(test_x, test_y)\n",
    "test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Initialize Hidden Layer Inducing Points\n",
    "\n",
    "Just to match the setup of the paper, we initialize the inducing points for each GP in the hidden layer by using kmeans clustering. The `DSPPHiddenLayer` class as defined below can also take `num_inducing=300, inducing_points=None` as arguments to randomly initialize the inducing points. However, we find that using kmeans to initialize can improve optimization in most cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "from scipy.cluster.vq import kmeans2\n",
    "\n",
    "# Use k-means to initialize inducing points (only helpful for the first layer)\n",
    "inducing_points = (train_x[torch.randperm(min(1000 * 100, train_n))[0:num_inducing_pts], :])\n",
    "inducing_points = inducing_points.clone().data.cpu().numpy()\n",
    "inducing_points = torch.tensor(kmeans2(train_x.data.cpu().numpy(),\n",
    "                               inducing_points, minit='matrix')[0])\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    inducing_points = inducing_points.cuda()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create The `DSPPHiddenLayer` Class\n",
    "\n",
    "The next cell is the most important in the notebook. It will likely be instructive to compare the code below to the analogous code cell in our Deep GP notebook, as they are essentially the same. The only difference is some code at the start to handle the fact that we may pass in prespecified inducing point locations, rather than always initializing them randomly.\n",
    "\n",
    "Regardless, the best way to think of a DSPP (or DGP) hidden layer class is as a standard GPyTorch variational GP class that has two key aspects:\n",
    "\n",
    "1. It has a batch shape equal to `output_dims`. In other words, the way we handle a layer of multiple GPs is with a batch dimension. This means that inducing_points, kernel hyperparameters, etc should all have `batch_shape=torch.Size([output_dims])`.\n",
    "2. It extends `DSPPLayer` rather than ApproximateGP.\n",
    "\n",
    "These are really the only two differences. A DSPPLayer / DGPLayer will still define a variational distribution and strategy, a prior mean and covariance function, and forward is still responsible for returning the prior. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "class DSPPHiddenLayer(DSPPLayer):\n",
    "    def __init__(self, input_dims, output_dims, num_inducing=300, inducing_points=None, mean_type='constant', Q=8):\n",
    "        if inducing_points is not None and output_dims is not None and inducing_points.dim() == 2:\n",
    "            # The inducing points were passed in, but the shape doesn't match the number of GPs in this layer.\n",
    "            # Let's assume we wanted to use the same inducing point initialization for each GP in the layer,\n",
    "            # and expand the inducing points to match this.\n",
    "            inducing_points = inducing_points.unsqueeze(0).expand((output_dims,) + inducing_points.shape)\n",
    "            inducing_points = inducing_points.clone() + 0.01 * torch.randn_like(inducing_points)\n",
    "        if inducing_points is None:\n",
    "            # No inducing points were specified, let's just initialize them randomly.\n",
    "            if output_dims is None:\n",
    "                # An output_dims of None implies there is only one GP in this layer\n",
    "                # (e.g., the last layer for univariate regression).\n",
    "                inducing_points = torch.randn(num_inducing, input_dims)\n",
    "            else:\n",
    "                inducing_points = torch.randn(output_dims, num_inducing, input_dims)\n",
    "        else:\n",
    "            # Get the number of inducing points from the ones passed in.\n",
    "            num_inducing = inducing_points.size(-2)\n",
    "        \n",
    "        # Let's use mean field / diagonal covariance structure.\n",
    "        variational_distribution = MeanFieldVariationalDistribution(\n",
    "            num_inducing_points=num_inducing,\n",
    "            batch_shape=torch.Size([output_dims]) if output_dims is not None else torch.Size([])\n",
    "        )\n",
    "        \n",
    "        # Standard variational inference.\n",
    "        variational_strategy = VariationalStrategy(\n",
    "            self,\n",
    "            inducing_points,\n",
    "            variational_distribution,\n",
    "            learn_inducing_locations=True\n",
    "        )\n",
    "\n",
    "        batch_shape = torch.Size([]) if output_dims is None else torch.Size([output_dims])\n",
    " \n",
    "        super(DSPPHiddenLayer, self).__init__(variational_strategy, input_dims, output_dims, Q)\n",
    "    \n",
    "        if mean_type == 'constant':\n",
    "            # We'll use a constant mean for the final output layer.\n",
    "            self.mean_module = ConstantMean(batch_shape=batch_shape)\n",
    "        elif mean_type == 'linear':\n",
    "            # As in Salimbeni et al. 2017, we find that using a linear mean for the hidden layer improves performance.\n",
    "            self.mean_module = LinearMean(input_dims, batch_shape=batch_shape)\n",
    "            \n",
    "        self.covar_module = ScaleKernel(MaternKernel(batch_shape=batch_shape, ard_num_dims=input_dims),\n",
    "                                        batch_shape=batch_shape, ard_num_dims=None)\n",
    "\n",
    "    def forward(self, x, mean_input=None, **kwargs):\n",
    "        mean_x = self.mean_module(x)\n",
    "        covar_x = self.covar_module(x)\n",
    "        return gpytorch.distributions.MultivariateNormal(mean_x, covar_x)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create the `DSPP` Class\n",
    "\n",
    "The below creates a DSPP container that is virtually identical to the one used in the DGP setting. All it is responsible for is insantiating the layers of the DSPP (in this case, one hidden layer and one output layer), and then defining a `forward` method that passes data through both layers.\n",
    "\n",
    "As in the DGP example notebook, we also define a `predict` method purely for convenience that takes a `DataLoader` and returns predictions for every example in that DataLoader."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "class TwoLayerDSPP(DSPP):\n",
    "    def __init__(self, train_x_shape, inducing_points, num_inducing, hidden_dim=3, Q=3):\n",
    "        hidden_layer = DSPPHiddenLayer(\n",
    "            input_dims=train_x_shape[-1],\n",
    "            output_dims=hidden_dim,\n",
    "            mean_type='linear',\n",
    "            inducing_points=inducing_points,\n",
    "            Q=Q,\n",
    "        )\n",
    "        last_layer = DSPPHiddenLayer(\n",
    "            input_dims=hidden_layer.output_dims,\n",
    "            output_dims=None,\n",
    "            mean_type='constant',\n",
    "            inducing_points=None,\n",
    "            num_inducing=num_inducing,\n",
    "            Q=Q,\n",
    "        )\n",
    "\n",
    "        likelihood = GaussianLikelihood()\n",
    "\n",
    "        super().__init__(Q)\n",
    "        self.likelihood = likelihood\n",
    "        self.last_layer = last_layer\n",
    "        self.hidden_layer = hidden_layer\n",
    "\n",
    "    def forward(self, inputs, **kwargs):\n",
    "        hidden_rep1 = self.hidden_layer(inputs, **kwargs)\n",
    "        output = self.last_layer(hidden_rep1, **kwargs)\n",
    "        return output\n",
    "\n",
    "    def predict(self, loader):\n",
    "        with settings.fast_computations(log_prob=False, solves=False), torch.no_grad():\n",
    "            mus, variances, lls = [], [], []\n",
    "            for x_batch, y_batch in loader:\n",
    "                preds = self.likelihood(self(x_batch, mean_input=x_batch))\n",
    "                mus.append(preds.mean.cpu())\n",
    "                variances.append(preds.variance.cpu())\n",
    "                \n",
    "                # Compute test log probability. The output of a DSPP is a weighted mixture of Q Gaussians,\n",
    "                # with the Q weights specified by self.quad_weight_grid. The below code computes the log probability of each\n",
    "                # test point under this mixture.\n",
    "                \n",
    "                # Step 1: Get log marginal for each Gaussian in the output mixture.\n",
    "                base_batch_ll = self.likelihood.log_marginal(y_batch, self(x_batch))\n",
    "                \n",
    "                # Step 2: Weight each log marginal by its quadrature weight in log space.\n",
    "                deep_batch_ll = self.quad_weights.unsqueeze(-1) + base_batch_ll\n",
    "                \n",
    "                # Step 3: Take logsumexp over the mixture dimension, getting test log prob for each datapoint in the batch.\n",
    "                batch_log_prob = deep_batch_ll.logsumexp(dim=0)\n",
    "                lls.append(batch_log_prob.cpu())\n",
    "\n",
    "        return torch.cat(mus, dim=-1), torch.cat(variances, dim=-1), torch.cat(lls, dim=-1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "TwoLayerDSPP(\n",
       "  (likelihood): GaussianLikelihood(\n",
       "    (noise_covar): HomoskedasticNoise(\n",
       "      (raw_noise_constraint): GreaterThan(1.000E-04)\n",
       "    )\n",
       "  )\n",
       "  (last_layer): DSPPHiddenLayer(\n",
       "    (variational_strategy): VariationalStrategy(\n",
       "      (_variational_distribution): MeanFieldVariationalDistribution()\n",
       "    )\n",
       "    (mean_module): ConstantMean()\n",
       "    (covar_module): ScaleKernel(\n",
       "      (base_kernel): MaternKernel(\n",
       "        (raw_lengthscale_constraint): Positive()\n",
       "      )\n",
       "      (raw_outputscale_constraint): Positive()\n",
       "    )\n",
       "  )\n",
       "  (hidden_layer): DSPPHiddenLayer(\n",
       "    (variational_strategy): VariationalStrategy(\n",
       "      (_variational_distribution): MeanFieldVariationalDistribution()\n",
       "    )\n",
       "    (mean_module): LinearMean()\n",
       "    (covar_module): ScaleKernel(\n",
       "      (base_kernel): MaternKernel(\n",
       "        (raw_lengthscale_constraint): Positive()\n",
       "      )\n",
       "      (raw_outputscale_constraint): Positive()\n",
       "    )\n",
       "  )\n",
       ")"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model = TwoLayerDSPP(\n",
    "    train_x.shape,\n",
    "    inducing_points,\n",
    "    num_inducing=num_inducing_pts,\n",
    "    hidden_dim=hidden_dim,\n",
    "    Q=num_quadrature_sites\n",
    ")\n",
    "\n",
    "if torch.cuda.is_available():\n",
    "    model.cuda()\n",
    "\n",
    "model.train()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gpytorch.mlls import DeepPredictiveLogLikelihood\n",
    "\n",
    "adam = torch.optim.Adam([{'params': model.parameters()}], lr=initial_lr, betas=(0.9, 0.999))\n",
    "sched = torch.optim.lr_scheduler.MultiStepLR(adam, milestones=milestones, gamma=0.1)\n",
    "\n",
    "\n",
    "# The \"beta\" parameter here corresponds to \\beta_{reg} from the paper, and represents a scaling factor on the KL divergence\n",
    "# portion of the loss.\n",
    "objective = DeepPredictiveLogLikelihood(model.likelihood, model, num_data=train_n, beta=0.05)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the Model\n",
    "\n",
    "Below is a standard minibatch training loop."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tqdm\n",
    "\n",
    "epochs_iter = tqdm.notebook.tqdm(range(num_epochs), desc=\"Epoch\")\n",
    "\n",
    "for i in epochs_iter:\n",
    "    minibatch_iter = tqdm.notebook.tqdm(train_loader, desc=\"Minibatch\", leave=False)\n",
    "    for x_batch, y_batch in minibatch_iter:\n",
    "        adam.zero_grad()\n",
    "        output = model(x_batch)\n",
    "        loss = -objective(output, y_batch)\n",
    "        loss.backward()\n",
    "        adam.step()\n",
    "    sched.step()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Make Predictions, compute RMSE and Test NLL"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "RMSE:  0.04274941563606262 Test NLL:  -1.690010404586792\n"
     ]
    }
   ],
   "source": [
    "model.eval()\n",
    "means, vars, ll = model.predict(test_loader)\n",
    "weights = model.quad_weights.unsqueeze(-1).exp().cpu()\n",
    "# `means` currently contains the predictive output from each Gaussian in the mixture.\n",
    "# To get the total mean output, we take a weighted sum of these means over the quadrature weights.\n",
    "rmse = ((weights * means).sum(0) - test_y.cpu()).pow(2.0).mean().sqrt().item()\n",
    "ll = ll.mean().item()\n",
    "\n",
    "print('RMSE: ', rmse, 'Test NLL: ', -ll)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
